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ABSTRACT 


In the contemporary era, there is unprecedented increase in multimedia content, especially videos, leading to 
consumption of more bandwidth when transmitted. Video compression is the technique that leverages 
performance of video transmission as it reduces original size of the video. Though the conventional video 
compression methods have classical architecture to encode motion and residual information efficiently, it 
lacks the ability to have non-linear representation of data. In this paper, we proposed a framework named 
Artificial Intelligence (AIT) enabled Video Compression Framework (AIVCF) which exploits the traditional 
classical architecture and combines it with a deep learning model for non-linear data representation. This 
framework has ability to have joint optimization of underlying components. Convolutional Neural Network 
(CNN) is used to reconstruct current frames by getting motion information through a process known as 
optical flow estimation. The information of given video is compressed using deep learning models in auto- 
encoder fashion. The framework strikes balance between quality and compression ability. An algorithm 
named Deep Joint Optimization for Video Compression (DJO-VC) is proposed to realize the AIVCF. The 
proposed framework is evaluated with empirical study. The experimental results, in terms of PSNR and SSIM 
revealed that the proposed framework outperforms existing models such as H.264. 


Keywords — Video Compression, Deep Learning, Convolutional Neural Network, Artificial Intelligence 
Enabled Video Compression Framework 


1. INTRODUCTION many deep learning models found for video 


compression. Ma et al. [2] opined that CNN has 


Deep learning based approaches have paved way 
for solving many real world problems. They are 
widely used in computer vision applications due 
to their inspiration with learned solutions 
video/image processing problems such as super 
resolution, action recognition and compression to 
mention few. Thus deep learning became an 
indispensable approach for nonlinear signal 
processing. Moreover, it is found from recent 
works that learned models have achieved 
significant performance improvements in 
perceptual quality measures when compared with 
state of the art [1]. From the literature, there are 


potential to solve problems associated with signal 
processing. Yang et al. [3] proposed a 
compression technique known as Recurrent 
Learned Video Compression (RLVC). Liu et al. 
[6] explored many CNN based models for solving 
video compression problems. Pessoa et al. [10] 
proposed a deep learning based framework for 
video compression with end to end learning by 
exploiting spatio-temporal auto-encoders. Zhang 
et al. [13] proposed a CNN based methodology 
for post processing towards video compression. 
They explored Generative Adversarial Network 
(GAN) architecture comprising generator (G) and 
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discriminator (D) for efficiency in video 
compression. Nagaraj et al. [17] used deep 
learning technique like LSTM to improve feature 
extraction and apply it for data compression. As 
found in the related works, it is observed that the 
existing compression methods use only few 
reference frames to compress a video frame which 
jeopardises the ability to extract temporal 
correlation among different video frames. To 
address the aforementioned problem, we 
proposed a deep learning based framework. 
Figure 1 shows the outline of our approach for 
video frame prediction. 


Existing Image Codec 


Binary Motion 
encoding 


Conditioning Network 


Predicted Video 
Frames 


Figure 1: Overview Of Video Frame Prediction 
Process 


As presented in Figure 1, our approach in this 
paper for video prediction process is illustrated. It 
makes use of reference frames and reference. 
They are subjected to motion encoder, binary 
motion encoding and decoder towards prediction 
of video frames. The process involves usage of 
existing image codec and conditioning network. 
More details of the proposed approach are 
provided in Section 3. Our contributions in this 
paper are as follows. 


1. A framework named _ Artificial 
Intelligence (AI) enabled Video Compression 


Referencing Frames & Reference j 


Framework (AIVCF) is proposed. It exploits the 
traditional classical architecture and combines it 
with a deep learning model for non-linear data 
representation. 

2. An algorithm named Deep _ Joint 
Optimization for Video Compression (DJO-VC) 
is proposed to realize the AIVCF. 

3. A prototype application is developed to 
evaluate the proposed framework and underlying 
algorithm. 


The remainder of the paper is structured as 
follows. Section 2 reviews latest related works on 
deep learning based video compression 
chniques. Section 3 presents our framework and 
lgorithm. Section 4 gives details of experimental 
setup. Section 5 presents experimental results 
while section 6 concludes our work besides 
specifying future scope. 


2. RELATED WORK 


This section reviews latest related works on deep 
learning based video compression techniques. Ma 
et al. [2] opined that CNN has potential to solve 
problems associated with signal processing. They 
emphasized that cutting edge video compression 
techniques are possible with deep learning models 
as they can exploit parallel computing supported 
by Graphical Processing Unit (GPU) and Tensor 
Processing Unit (TPU). Yang et al. [3] proposed 
a compression technique known as Recurrent 
Learned Video Compression (RLVC). RLVC 
makes use of Recurrent Probability Model (RPM) 
and Recurrent Auto-Encoder (RAE). It is a 
learned video compression technique which could 
extract temporal correlations mong _ frames. 
However, it still suffers from rate-distortion 
performance and complexity. Lu et al. [4] 
proposed an end-to-end framework for video 
compression using deep learning. It makes use of 
pixel wise motion information and auto-encoder 
with joint optimization considering rate-distortion 
trade-off. It exploits non-linear representation 
capability of deep neural networks (DNNs). Chen 
et al. [5] proposed a methodology for video 
compression using deep feature coding and lossy 
compression technique. It enables cloud based 
visual analysis by reducing overhead with novel 
data transmission strategy. 
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Liu et al. [6] explored many CNN based models 
for solving video compression problems. They 
suggested to deepen learning processes with 
variants of CNN for further improvement in 
compression performance. Xu et al. [7] made a 
comparative study of traditional methods and 
deep learning based approaches for compressing 
videos. They found that end to end learning and 
usage of different learning based entropy methods 
could improve compression performance. 
Westland et al. [8] exploited decision trees in 
order to reducing complexity in the process of 
video compression. Friedland et al. [9] 
investigated on the influence of perceptual 
compression on deep learning models. Their 
empirical study has found that deep learning 
models have the capability to exploit perceptual 
compression. They advocate the importance of 
using novel metrics rather than tuning hyper 
parameters. Pessoa et al. [10] proposed a deep 
learning based framework for video compression 
with end to end learning by exploiting spatio- 
temporal auto-encoders. It has provision for rate- 
distortion optimization to reduce inconsistencies 
among video frames. They achieved latent space 
representation through by obtaining  spatio- 
temporal dependencies. Poyser et al. [11] 
explored CNN architectures and investigated the 
impact of lossy video compression methods on 
them. They found that lossy compression has 
potential to impact performance of deep learning 
models. Valenzise et al. [12] focused on deep 
learning based approaches for image 
compression. They have made subjective 
evaluation of two deep CNN models for image 
compression and found that both do have 
performance improvement over _ traditional 
methods. 


Zhang et al. [13] proposed a CNN based 
methodology for post processing towards video 
compression. They explored Generative 
Adversarial Network (GAN) architecture 
comprising generator (G) and discriminator (D) 
for efficiency in video compression. Chen et al. 
[14] proposed a compression model to compress 
deep learning models for ease of transmission 
over Internet. Liu et al. [15] proposed a deep 
learning model for distortion prediction in image 
compression use cases. Birman ef al. [16] 
investigated on various deep learning models 
including CNN, auto encoder and GAN for video 


compression. Nagaraj et al. [17] used deep 
learning technique like LSTM to improve feature 
extraction and apply it for data compression. 
Krishnaraj et al. [18] considered an IoT use case 
known as Internet of Underwater Things (IoUT). 
In such environment, they implemented real-time 
image compression using DWT-CNN model. Das 
et al. [19] explored JPEG compression and deep 
learning models to incorporate security to images. 
Chen et al. [20] proposed a methodology for 
knowledge as a service for automatic compression 
of images using deep learning. 


Table 1: Shows Summary Of Most Relevant Deep 
Learning Models For Video Compression 


Refe Appr Algorithm Data _ Limit 
rence oach /Techniqu _ set ations 
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et al., neural based baseli 
[7] netwo model ne 
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Naga Deep LSTM MNI _ Error Duan et al. [24] investigated on the notion of 
raj et learni ST rate is collaborative compression with video coding 
al., ng more. approaches. Chen eft al. [25] proposed a deep 
[17] and feature compression technique for intelligent 
featur sensing. Kuanar et al. [26] focused on HEVC in- 
e loop filtering using deep learning for improving 
extrac quality of decoder. Li et al. [27] proposed a deep 
tion learning model based on _ Trellis Coded 
Kris Deep DWT- UW It has Quantization for image compression. Other 
hnara_ learni CNN SN issues contributions found in the literature include 
j et ng with HEVC intra-frame coding with deep learning [28] 
al., based noisy and Temporal 3-D CNN based method for video 
[18] on enviro compression [29]. Table 1 shows summary of 
DWT nment. most relevant related works on deep learning 
Wied DNN  Context- Imag Achie based video compression. From the literature, it is 
eman based based eNet  vable understood that the conventional compression 
n et Unive Adaptive , compr methods use only few reference frames to 
al., rsal Binary CIF — ession compress a video frame which jeopardises the 
[22] Comp Arithmetic ARI __ limits ability to extract temporal correlation among 
ressio Coder 0, are to different video frames. It is improved with deep 
n (CABAC) MNI_ be learning models as they support non-linear 
ST investi approach. However, there is need for further 
gated. research to have more robust approach in video 
“Duan Deep Video | PKU It has compression using deep learning. 
: - sie seria cad Table 2: Notations Used In The Paper 
with D proble Notation Description 
collab m. 
orativ I reference frames 
e 
compr P,B referencing (P-frame and B- 
ession frame) frames 
Sinha CNN Temporal UCF Lower E Encoder 
et al, based 3-D CNN 101, visual 
[29] appro based Kine quality D Decoder 
ach encoder tic- and 
and Y- 5K loss of Cond conditioning network 
style and  motio 
CNN UV on M Mask 
based G inform 
decoder ation. integer levels 
Prakash et al. [21] proposed a novel CNN 
architecture to achieve semantic perceptual image % ground truth flow 
compression. In the process, they exploited multi- 
structure Region of Interest (ROI). Wiedemann et a the flow vectors derived from 
al. [22] proposed a common compression the frames 
technique using deep learning and named it as EPE end-point-error 
DeepCABAC. It has provision to reduce rate- 
distortion and also a novel quantization scheme. Lp reconstruction loss 
Vega et al. [23] proposed deep learning method 


for examining quality of live video streaming. 
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Lp Loss 
» Hyperparameter 
Le the optical flow losses 
a. weighting term 
3. PROPOSED FRAMEWORK 


We proposed a framework named Artificial 
Intelligence (AI) enabled Video Compression 
Framework (AIVCF). It has different mechanism 
and underlying algorithm for efficient video 
compassion. The framework has provision for 
combining conventional architecture and deep 
learning model such as CNN for non-linear data 
representation. CNN is used to reconstruct current 
frames through optical flow estimation for 
obtaining motion information. Auto-encoder 
based deep learning model is used to compress 
information of given video. For compressing 


given video, it is important to achieve deep 
motion estimation and frame prediction. Figure 2 
shows the architectural overview for predicting P- 
frames. The input video frames are subjected to 
different operations including encoding and 
decoding in order to predict P-frames. The input 
video frameworks are taken by motion encoder 
which automatically compresses motion 
information among the frames. Then binary 
motion code is generated by the encoder. Each 
frame in the video input is given in such a way 
that it contains reference denoted as I and a 
referencing B or P frame. The binarization 
process made by motion encoder is based on 
thresholding. It exploits the binarization function 
discussed in [30]. In the process of training the 
outcome of motion encoder is in the form of 
binary value with noise added. The 


value is either -1 or 1. In the process, the 
estimation of gradients is done using the 
procedure provided in [31]. 


P-Frame 
Predictions 


Input 
Video 
Pictures 


P-Frame Decoder 


Motion Encoder 


Binary Motion 
Code 


Conditioning Network 


Figure 2: Architectural Overview Of P-Frame Prediction Process 


The features of I-frame are extracted at the 
decoder using conditional network. As per the 
binarized motion encoding information, the 
extracted features are exploited to predict P- 
frames. An existing codec is used for image 
compression and it is not actually done by the 
conditional network. The P-frame prediction 
procedure is expressed as in Eq. 1. Table 2 has 
details of notations used in this paper. 
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(1) 


The decoder denoted as D exploits reference 
frames in / with the help of conditioning network. 
Thus it is able to predict sequence of frames to be 
P-frames. Encoder on the other hand always 
compresses the inputs. The bit rate in the process 
of P-frame detection is determined by the output 
channels used in the encoding layer. In order 
words, extrapolation is carried out by decoder. 
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Figure 3: Architectural Overview Of B-Frame Prediction Proces 


As presented in Figure 3, it illustrates the process 
involved in B-frame prediction. The input video 
frameworks are taken by motion encoder which 
automatically compresses motion information 
among the frames. Then binary motion code is 
generated by the encoder. Each frame in the video 
input is given in such a way that it contains 
reference denoted as I and a referencing B or P 
frame. The binarization process made by motion 
encoder is based on thresholding. It exploits the 
binarization function discussed in [30]. In the 
process of training the outcome of motion encoder 
is in the form of binary value with noise added. 
The value is either -1 or 1. In the process, the 
estimation of gradients is done using the 
procedure provided in [31]. The features of I- 
frame are extracted at the decoder using 
conditional network. As per the binarized motion 
encoding information, the extracted features are 
exploited to predict B-frames. An existing codec 
is used for image compression and it is not 
actually done by the conditional network. The B- 
frame prediction procedure is expressed as in Eq. 


pinion 


Cond,(, It+1) (2) 


The decoder denoted as D exploits reference 
frames in / with the help of conditioning network. 
Thus it is able to predict sequence of frames to be 
B-frames using interpolation unlike decoder in P- 
frame prediction process. Encoder on the other 
hand always compresses the inputs. The bit rate in 
the process of B-frame detection is determined by 
the output channels used in the encoding layer. In 
case of both the processes found in Figure 2 and 
Figure 3, L2 reconstruction loss is computed in 
the training phase as expressed in Eq. 3. 


Le=B-B (I? or WP—P I, 
(3) 


In the training period, the decoder is given access 
to I-frame content (represents an entire image in 
video). However, the at the time of testing 
encoding and are taken place independently with 
the help of an image codec. Convolutional layers 
(multi-scale) discussed in [32] are preferred in the 
prediction process as the motion in given video 
occurs differently at different scales. Each 
convolutional layer has ability to exploit learned 
“scale invariant feature transform (SIFT)”. The 
conditioning process in the given architectures at 
the decoder has ability to detect the frame 
correctly. When compared with raw video frames, 
the binary motion codes obtained in the prediction 
process are more compressible. The proposed 
designs for detection P and B frames support 
different frame sizes and different number of 
images/pictures present in the given video. 


Algorithm 1: Deep Joint Optimization For Video 
Compression (DJO-VC) 


Algorithm: Deep Joint Optimization for Video 
Compression (DJO-VC) 

Input: 

Video denoted V containing a set of pictures 
Output: 

Compressed video V’ 


1. Start 

2. Initialize P-Frames vector XY 

3; Initialize B-Frames vector Y 

4. Initialize binary motion code vector 
M 

5: I€GeneratelFrames(V) 

Detection of P-Frames 

6. For each IJ-frame 7 in I 
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T. For each reference and reference created. The encoder produces bits for each video 
frame r in R frame and they are divided into L groups. Each 
8. M€MotionEncoder(r) Bmap element is denoted as bt, h, w which is 
9 IF quantized as expressed in Eq. 4. 


CondDecoderExtrapolation(M) = —>P-Frame 
Then 


10. Add M to X 
11. End If 

12. End For 

13. End For 


Detection of B-Frames 
14. For each I-frame i in I 


15. For each reference and reference 
frame rin R 

16. M€MotionEncoder(r) 

17. IF CondDecoderInterpolation(M) 
>B-Frame Then 

18. Add M to Y 

19. End If 

20. End For 

21. End For 

22, V’<GenerateOutput(/, X, Y) 

23. Compute loss functions 

24. Performance evaluation 

25. Display statistics 


26. Return V’ 
As presented in Algorithm 1, it takes given video 
as input and generates a compressed video with 
better performance. It has deep CNN based multi- 
scale convolutional layers used in the prediction 
of P and B frames. The algorithm reflects 
prediction of P-frames and also B-frames with 
automatic compression prior to generating a final 
compression video which is used for transmission 
of networks. The motion encoder performs 
compression of motion information from given 
video pictures and represents data in the form of - 
1 or 1. The decoder used in P-frame detection uses 
extrapolation for detection of P-frames while the 
decoder used in B-frame detection uses 
interpolation for detection of B-frames. 


In order to bring about flexibility in generation of 
binary motion codes we incorporate time 
dimension using the approach presented in [33]. It 
helps sin adapting bit rate based on different 
regions of video and the content involved in the 
regions. The encoder identifies spatio-temporal 
locations and allocate fixed number of bits. The 
underlying motion encoder uses number of bit 
channels based on points in space-time. In the 
process a bit distribution map, denoted as Bmap is 


Q, = (benw) = [Lbz nw] 
(4) 


For each space-time point, it determines number 
of bit levels needed. A bit masking is generated 
further in order to get rid of allocation of non- 
integer bit numbers. It is expressed as in Eq. 5. 


. Chnd 
_ tn ifcs % Q1(benw) 
Mothw A 
0, otherwise 


(5) 


In order to ensure that the decoder ascertains bit 
stream correctly, an additional loss term is 
computed as in Eq. 6. 


Lz = Lenw De nw 


(6) 


This loss term is used to prevent bit assignment to 
video regions that are stationary that can be 
ignored from the given I-frame. The operations in 
Eq. 4 and Eq. 5 are non-differentiable. In order to 
achieve final dynamic bit assignment 
approximation is made as expressed in Eq. 7. 


OMct,h,w_ 
Obthnw 
5 [cL] 
{ if Lbtnw —-1< < Lbenyw +2 
i Cond vs 
0, otherwise 


(7) 


We also explored a loss term based on optimal 
flow for improving motion compression process. 
Between two frames of video, optical flow 
reflects the pixel movement as discussed in [34]. 
The optical flow based loss function in terms of 
end point error is as in Eq. 8 and cosine similarity 
is expressed in Eq. 9. 


Lepr = JUV, —% I, 
(8) 
L =p ae 
cosine Ta V7 
IVgll Mp 


(9) 


The two measures such as Lepg and Leosine 
functions differently as the latter penalizes 
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directional deviations between predicted vectors 
and ground truth. After training the models in 
Figure 2 and Figure 3 (after getting pre-trained 
models), further training is carried out to gain 
knowledge on dynamic bit assignment. This 
optimization function with 150 additional epochs 
is expressed as in Eq. 10. 


Lrp= Lp +X Lp 
(10) 


It combines two kinds of losses computed in Eq. 
3 and Eq. 6 in order to improve the evaluation 
process. In order to strike balance between 
compression rate and reconstruction quality we 
introduced a hyper parameter known as >. 


Lap= Lp +% Le 
(11) 


The loss function expressed in Eq. 11 is used in 
order to minimize difference between predicted 
frame’s and input frame’s optical flow. Here the 
optimal flow loss is denoted by Lr and distortion 
loss is denoted by Lr. The performance of the 
proposed framework is evaluated using three 
objective metrics. Peak Signal to Noise Ratio 
(PSNR) is one of the metrics used to know quality 
of predicted video frames. Video Multi-Method 
Assessment Fusion (VMAF) [35] is another 
metric used for evaluation. The third metric is 
known as Structural SIMilarity index (SSIM) 
[36]. 


4. EXPERIMENTAL SETUP 


Python data science platform with Python 3 is 
used for application development and algorithm 
implementation. The deep neural network 
architectures for P-Frame and B-Frame detection 
procedures are built using Pytorch 1.0.1. Other 
important Python libraries used for 
implementation are OpenCV, ScikitImage and 
ScikitVideo. The deep neural networks involved 
in P and B frame detection procedures are trained 
using Hallywood dataset [57]. The dataset has 475 
diversified video clips in AVI format. To be 
compatible with data loader in_ the 
implementation, each clip is transcoded with 
H.264 [5] codec. Out of 475 video clips, we used 
435 for training and 40 for validation. Initial 
learning rate for deep learning architectures is set 
to 0.0001. The optimizer is known as Adam and 


the number of epochs used in the empirical study 
is 150. 


5. RESULTS AND DISCUSSION 


The proposed learned video compression 
technique using deep learning is evaluated and 
compared with conventional codecs. Different 
performance metrics used for evaluation are 
PSNR, VMAF and SSIM. 


0 


20 


0 50 100 150 200 250 


Figure 4: Result Of Pre-Processing To Obtain Set Of 
Pictures From Video 


As presented in Figure 4, the given video is 
subjected to pre-processing and it has resulted in 
a set of pictures that are used further to achieve 
learned video compression. The resultant pictures 
are used as input to the proposed deep learning 
approach and the compression process is based on 
learning which is found to have better 
performance. 


Figure 5: Compressed Frames With Bit Rate Per Pixel 
0.2121 


It is observed from the empirical study that the bit 
rate per pixel has its influence on the visual 
quality of the compressed frames. As presented in 
Figure 5, the pictures acquired from a video are 
subjected to deep learning based compression. 
The visual quality visible here is with bit rate per 
pixel 0.2121. 
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Figure 6: Compressed frames with bit rate per pixel 
0.2176 


As presented in Figure 6, the pictures acquired 
from a video are subjected to deep learning based 
compression. The visual quality visible here is 
with bit rate per pixel 0.2176. 


Figure 7: Compressed frames with bit rate per 
pixel 0.2597 


As presented in Figure 7, the pictures acquired 
from a video are subjected to deep learning based 
compression. The visual quality visible here is 
with bit rate per pixel 0.2597. 


5.1 Compression Performance with P- 
Frame Prediction 


This section presents results of empirical study 
using the proposed framework AIVCF 
considering P-Frame prediction for video 
compression. It is also compared with video 
compression using B-Frame detection with 
optimization. The optimized version exploits 
dynamic bit assignment for improving 
compression efficiency. Experiments are made 
with different bits-per-pixel and the performance 
is evaluated in terms of PSNR, SSIM and VMAF. 
In other words, rate-distortion analysis is made 
and observations are recorded. 


4069 


Table 3: PSNR comparison between video compression 
with B-Frame detection and its optimized variant 


PSNR 
AIVCF (B- 
Frame 
AIVCF Detection) 
Bits-Per- (B-Frame with 
Pixel Detection) Optimization 
0.02 29.85 31.28 
0.04 30.05 31.45 
0.06 30.2 31.55 
0.08 30.4 31.55 
0.1 30.45 31.55 
0.12 30.48 31.55 


As presented in Table 3, video compression 
performance of B-Frame detection process and its 
optimized variant is compared against bit rate in 
terms of PSNR. 


Table 4: SSIM comparison between video compression 
with B-Frame detection and its optimized variant 


SSIM 
AIVCF (B- 
Bits- AIVCF Frame Detection) 
Per- (B-Frame with 
Pixel Detection) Optimization 
0.02 0.844 0.878 
0.04 0.849 0.883 
0.06 0.852 0.884 
0.08 0.857 0.884 
0.1 0.86 0.884 
0.12 0.864 0.884 


As presented in Table 4, video compression 
performance of B-Frame detection process and its 
optimized variant is compared against bit rate in 
terms of SSIM. 
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Table 5: VMAF comparison between video the proposed framework with B-Frame prediction 


compression with B-Frame detection and its optimized 
variant 


VMAF 
AIVCF (B- 
Bits- AIVCF Frame Detection) 
Per- (B-Frame with 
Pixel Detection) Optimization 
0.02 71.4 75 
0.04 72.5 137 
0.06 72.9 76.2 
0.08 73.1 76.2 
0.1 73.4 76.2 
0.12 73.7 76.2 


As presented in Table 5, video compression 
performance of B-Frame detection process and its 
optimized variant is compared against bit rate in 
terms of VMAF. 


PSNR 
PERFORMANCE 


31.75 +~——_____ 
31.25 —<— = —o— AIVCF (B- 


30.75 Frame 


30.25 - menu 
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0.020.06 0.1 
bits-per-pixel (bpp) 


Figure 8: Rate-distortion analysis in terms of PSNR 


As presented in Figure 8, bits-per-pixel rate is 
used for experimentation. Different rates of bits- 
per-pixel are provided in horizontal axis. With the 
given rate, PSNR is computed to ascertain video 
compression performance. Higher in PSNR value 
indicates less distortion and higher quality in 
compression. An important observation is that 
bits-per-pixel (rate) has its influence on PSNR. 
Another observation is that the optimized version 
of B-Frame prediction process used for video 
compression is found to have better performance 
over its un-optimized variant. When rate is 0.02 


process has achieved PSNR 29.85 while its 
optimized version that exploits dynamic bit 
assignment achieved PSNR 31.28. This trend is 
true with all rates with which experiments are 
made for deep learning based video compression. 
Therefore, it can be concluded that the optimized 
version of B-Frame prediction process shows 
significantly better performance over its un- 
optimized counterpart. 
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Figure 9: Rate-distortion analysis in terms of SSIM 


As presented in Figure 9, bits-per-pixel rate is 
used for experimentation. Different rates of bits- 
per-pixel are provided in horizontal axis. With the 
given rate, SSIM is computed to ascertain video 
compression performance. Higher in SSIM value 
indicates less distortion and higher quality in 
compression. An important observation is that 
bits-per-pixel (rate) has its influence on SSIM. 
Another observation is that the optimized version 
of B-Frame prediction process used for video 
compression is found to have better performance 
over its un-optimized variant. When rate is 0.02 
the proposed framework with B-Frame prediction 
process has achieved SSIM 0.844 while its 
optimized version that exploits dynamic bit 
assignment achieved SSIM 0.878. This trend is 
true with all rates with which experiments are 
made for deep learning based video compression. 
Therefore, it can be concluded that the optimized 
version of B-Frame prediction process shows 
significantly better performance over its un- 
optimized counterpart. 
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Figure 10: Rate-distortion analysis in terms of VMIF 


As presented in Figure 10, bits-per-pixel rate is 
used for experimentation. Different rates of bits- 
per-pixel are provided in horizontal axis. With the 
given rate, VMIF is computed to ascertain video 
compression performance. Higher in VMIF value 
indicates less distortion and higher quality in 
compression. An important observation is that 
bits-per-pixel (rate) has its influence on VMIF. 
Another observation is that the optimized version 
of B-Frame prediction process used for video 
compression is found to have better performance 
over its un-optimized variant. When rate is 0.02 
the proposed framework with B-Frame prediction 
process has achieved VMIF 71.4 while its 
optimized version that exploits dynamic bit 
assignment achieved VMIF 75. This trend is true 
with all rates with which experiments are made for 
deep learning based video compression. 
Therefore, it can be concluded that the optimized 
version of B-Frame prediction process shows 
significantly better performance over its un- 
optimized counterpart. 


5.2. Performance Evaluation of P-Frame 
Detection Process 


This section evaluates per performance of 
proposed learning based video compression using 
P-Frame detection process against standard 
codecs such as H.265 and H.264. Rate-distortion 
analysis is made with different performance 
metrics such as PSNR, SSIM and VMIF. 
Sampling of video clips is made using VTL 
dataset [40] where each clip is of 64x64 with 17 
frames. There are 16 referencing frames and an I- 
frame in each clip. Experiments are made with 


the proposed framework and existing codecs 
aforementioned. 


Table 6: PSNR performance comparison of P-Frame 
detection against H.264 and H.265 


PSNR 

AIVCF 
Bits-Per- (P-Frame H.2. | H.2 
Pixel Detection) 64 65 
0.1 20 0 0 
0.15 26.5 0 0 
0.2 28 0 0 
0.25 28.3 24.5 | 0 
0.3 28.5 27.8 | 25.8 
0.35 28.6 31 28.3 
0.4 28.7 33.5 | 31.5 


As presented in Table 6, PSNR performance of 
proposed framework AIVCF with P-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 


Table 7: SSIM performance comparison of P-Frame 
detection against H.264 and H.265° 


SSIM 

AIVCF 
Bits-Per- (P-Frame H.2. | H.2 
Pixel Detection) 
0.1 0.5 0 0 
0.15 0.82 0 0 
0.2 0.83 0 0 
0.25 0.84 0.75 | 0 
0.3 0.85 0.87 | 0.78 
0.35 0.86 0.93 | 0.88 
0.4 0.87 0.95 | 0.92 


As presented in Table 7, SSIM performance of 
proposed framework AIVCF with P-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 


Table 8: VMAF performance comparison of P-Frame 
detection against H.264 and H.265 


VMAF 
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AIVCF 
Bits-Per- (P-Frame H.2 | H.2 
Pixel Detection) 64 65 
0.1 30 
0.15 69 
0.2 71 
0.25 70 56 
0.35 72 85 81 
0.4 73 88 85 


As presented in Table 8, VMAF performance of 
proposed framework VMAF with P-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 
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Figure 11: Performance comparison of P-Frame 
detection with existing codecs H.264 and H.265 


As presented in Figure 11, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using PSNR 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on PSNR. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using P-Frame detection has 
significant performance improvement over the 
conventional techniques. However, P-Frame 
detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the P- 
Frame detection process has performance less 


than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 
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Figure 12: SSIM performance comparison of P- 
Frame detection with existing codecs H.264 and 
H.265 


As presented in Figure 12, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using SSIM 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on SSIM. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using P-Frame detection has 
significant performance improvement over the 
conventional techniques. However, P-Frame 
detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the P- 
Frame detection process has performance less 
than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 
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Figure 13: VMAF performance comparison of P- 
Frame detection with existing codecs H.264 and H.265 


As presented in Figure 13, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using VMAF 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on VMAF. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using P-Frame detection has 
significant performance improvement over the 
conventional techniques. However, P-Frame 
detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the P- 
Frame detection process has performance less 
than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 


5.3 Performance Evaluation of B-Frame 
Detection Process 


This section evaluates per performance of 
proposed learning based video compression using 
B-Frame detection process against standard 
codecs such as H.265 and H.264. Rate-distortion 
analysis is made with different performance 
metrics such as PSNR, SSIM and VMIF. 
Sampling of video clips is made using VTL 
dataset [40] where each clip is of 64x64 with 17 
frames. There are 16 referencing frames and an I- 


frame in each clip. Experiments are made with 
the proposed framework and existing codecs 
aforementioned. 


Table 9: PSNR performance comparison of B-Frame 
detection against H.264 and H.265 


AIVCF 


Bits-Per- (B-Frame 


Pixel Detection) 


As presented in Table 9, PSNR performance of 
proposed framework AIVCF with B-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 


Table 10: SSIM performance comparison of B-Frame 
detection against H.264 and H.265 


AIVCF 
Bits-Per- (B-Frame 
Pixel Detection) 


As presented in Table 10, SSIM performance of 
proposed framework AIVCF with B-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 
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Table 11: VMAF performance comparison of B-Frame 
detection against H.264 and H.265 


VMAF 

AIVCF 
Bits-Per- (B-Frame H.2 | H.2 
Pixel Detection) 64 65 
0.15 0 
0.2 50 
0.25 75 58 
0.3 76 75 72 
0.35 77 83 84 
0.4 78 85 86 


As presented in Table 11, VMAF performance of 
proposed framework AIVCF with B-Frame 
detection is compared against H.264 and H.265. 
Rate-distortion analysis is made with different 
bits-per-pixel values. 
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Figure 14: PSNR performance comparison of B- 
Frame detection with existing codecs H.264 and 
#265 


As presented in Figure 14, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using PSNR 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on PSNR. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using B-Frame detection has 
significant performance improvement over the 
conventional techniques. However, B-Frame 


detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the B- 
Frame detection process has performance less 
than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 
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Figure 15: SSIM performance comparison of B-Frame 
detection with existing codecs H.264 and H.265 


As presented in Figure 15, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using SSIM 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on SSIM. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using B-Frame detection has 
significant performance improvement over the 
conventional techniques. However, B-Frame 
detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the B- 
Frame detection process has performance less 
than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 
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Figure 16: VMAF performance comparison of B- 
Frame detection with existing codecs H.264 and 
1.265 


As presented in Figure 16, the observations are 
made with different rates as given in horizontal 
axis. The perceived quality of video due to 
compression techniques is measured using VMAF 
as given in vertical axis. It is observed that the 
bits-per-pixel has its influence on VMAF. Each 
compression technique has shown different level 
of performance due to the underlying 
mechanisms. However, the proposed learning 
based approach using B-Frame detection has 
significant performance improvement over the 
conventional techniques. However, B-Frame 
detection process outperforms other techniques 
only at low bit rates. At higher bit rates, the B- 
Frame detection process has performance less 
than that of H.264 and H.265. The rationale 
behind this is that the proposed framework does 
not consider compression of residual information 
but focuses on motion estimation. Only the inter- 
frame prediction approach in the proposed 
framework has resulted in performance 
improvement. 


6. CONCLUSION AND FUTURE WORK 


In this paper, we proposed a framework named 
Artificial Intelligence (AI) enabled Video 
Compression Framework (AIVCF) which 
exploits the traditional classical architecture and 
combines it with a deep learning model for non- 
linear data representation. This framework has 
ability to have joint optimization of underlying 
components. Convolutional Neural Network 
(CNN) is used to reconstruct current frames by 
getting motion information through a process 
known as_ optical flow estimation. The 
information of given video is compressed using 


Detection) 


deep learning models in auto-encoder fashion. 
The framework strikes balance between quality 
and compression ability. An algorithm named 
Deep Joint Optimization for Video Compression 
(DJO-VC) is proposed to realize the AIVCF. The 
proposed framework is evaluated with empirical 
study. The experimental results, in terms of PSNR 
and SSIM revealed that the proposed framework 
outperforms existing models such as H.264. 
However, the proposed framework AIVCF 
showed better performance only when there are 
low bit rates. When bit rate is high, its 
performance is not better than the conventional 
methods. he rationale behind this is that the 
proposed framework does not consider 
compression of residual information but focuses 
on motion estimation. Only the inter-frame 
prediction approach in the proposed framework 
has resulted in performance improvement. In 
future work, we intend to improve the framework 
to overcome this drawback besides considering 
other deep learning approaches. 
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